Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve multithreaded performance with memory prefetching #861

Merged
merged 7 commits into from
Aug 27, 2024

Conversation

uriyage
Copy link
Contributor

@uriyage uriyage commented Aug 1, 2024

This PR is the last of three pull requests intended to achieve the goal of processing one million requests per second.

1st PR: #758
2nd PR:#763

With these 3 PRs, we can now handle up to 1.2 million requests per second, up from 200K without IO-threads and ~380K with the current IO threads implementation.

This PR utilizes the IO threads to execute commands in batches, allowing us to prefetch the dictionary data in advance.

After making the IO threads asynchronous and offloading more work to them in the first 2 PRs, the lookupKey function becomes a main bottle-neck and it takes about 50% of the main-thread time (Tested with SET command). This is because the Valkey dictionary is a straightforward but inefficient chained hash implementation. While traversing the hash linked lists, every access to either a dictEntry structure, pointer to key, or a value object requires, with high probability, an expensive external memory access.

Memory Access Amortization

(Designed and implemented by dan touitou)

Memory Access Amortization (MAA) is a technique designed to optimize the performance of dynamic data structures by reducing the impact of memory access latency. It is applicable when multiple operations need to be executed concurrently. The principle behind it is that for certain dynamic data structures, executing operations in a batch is more efficient than executing each one separately.

Rather than executing operations sequentially, this approach interleaves the execution of all operations. This is done in such a way that whenever a memory access is required during an operation, the program prefetches the necessary memory and transitions to another operation. This ensures that when one operation is blocked awaiting memory access, other memory accesses are executed in parallel, thereby reducing the average access latency.

We applied this method in the development of dictPrefetch, which takes as parameters a vector of keys and dictionaries. It ensures that all memory addresses required to execute dictionary operations for these keys are loaded into the L1-L3 caches when executing commands. Essentially, dictPrefetch is an interleaved execution of dictFind for all the keys.

Implementation details

When the main thread iterates over the clients-pending-io-read, for clients with ready-to-execute commands (i.e., clients for which the IO thread has parsed the commands), a batch of up to 16 commands is created. Initially, the command's argv, which were allocated by the IO thread, is prefetched to the main thread's L1 cache. Subsequently, all the dict entries and values required for the commands are prefetched from the dictionary before the command execution. Only then will the commands be executed.

Copy link

codecov bot commented Aug 1, 2024

Codecov Report

Attention: Patch coverage is 4.46927% with 171 lines in your changes missing coverage. Please review.

Project coverage is 70.27%. Comparing base (1d18842) to head (ebbbed1).
Report is 52 commits behind head on unstable.

Files Patch % Lines
src/memory_prefetch.c 2.45% 159 Missing ⚠️
src/networking.c 15.38% 11 Missing ⚠️
src/io_threads.c 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #861      +/-   ##
============================================
- Coverage     70.35%   70.27%   -0.09%     
============================================
  Files           112      113       +1     
  Lines         61467    61711     +244     
============================================
+ Hits          43245    43365     +120     
- Misses        18222    18346     +124     
Files Coverage Δ
src/config.c 78.69% <ø> (ø)
src/dict.c 97.54% <100.00%> (+0.25%) ⬆️
src/kvstore.c 96.17% <100.00%> (ø)
src/server.c 88.57% <ø> (+<0.01%) ⬆️
src/server.h 100.00% <ø> (ø)
src/io_threads.c 7.87% <0.00%> (-0.04%) ⬇️
src/networking.c 88.42% <15.38%> (-0.30%) ⬇️
src/memory_prefetch.c 2.45% <2.45%> (ø)

... and 25 files with indirect coverage changes

Signed-off-by: Uri Yagelnik <[email protected]>
@uriyage uriyage force-pushed the commands-prefetching branch from ddb9ab2 to f821c86 Compare August 1, 2024 13:50
src/dict.h Outdated
@@ -45,7 +45,8 @@
#define DICT_ERR 1

/* Hash table parameters */
#define HASHTABLE_MIN_FILL 8 /* Minimal hash table fill 12.5%(100/8) */
#define HASHTABLE_MIN_FILL 8 /* Minimal hash table fill 12.5%(100/8) */
#define DictMaxPrefetchSize 16 /* Limit of maximum number of dict entries to prefetch */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did we reach to this number? Should we provide a config for this to modify (mutable/immutable) based on the workload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a heuristic, as is the entire feature. We found it to be optimal for GET commands with a value size of 512 bytes. However, the optimal setting may vary depending on the client load, command execution duration, and number of keys per command. I'm not sure about making this configuration modifiable, as determining an optimal value is complex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general concern I have is how much this might vary between different architectures. I know some of this comes from us fitting to the AWS graviton instances, but it could be very different from intel/amd/etc in addition to the workload specifics you mentioned. I guess I would like to see us at least set up a benchmark and test to see how widely it can vary, If the difference between most workloads is 2x, i.e. the optimal value is between 8-32, picking 16 is probably okay. If the optimal value goes between like 2 - 100, then maybe we should reconsider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will run some benchmarks, but I believe the optimal batch size depends much more on the specific workload. For example, if we run expensive commands against one big hashmap with expensive operations, we may want to decrease the batch size, as the prefetched items for the next commands may be pushed out of memory. Conversely, with MSET involving many small keys, the batch size can be larger. Maybe we should consider giving users the option to fine-tune this parameter and set the default to 16?"

src/dict.c Outdated Show resolved Hide resolved
src/dict.c Outdated Show resolved Hide resolved
src/dict.c Outdated Show resolved Hide resolved
src/dict.c Outdated Show resolved Hide resolved
src/dict.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Show resolved Hide resolved
src/server.h Show resolved Hide resolved
Signed-off-by: Uri Yagelnik <[email protected]>
@madolson
Copy link
Member

madolson commented Aug 6, 2024

@lipzhu Would be great if you could take a look as well.

utils/generate-fmtargs.py Outdated Show resolved Hide resolved
src/server.c Show resolved Hide resolved
src/server.c Outdated
@@ -5678,6 +5682,7 @@ sds genValkeyInfoString(dict *section_dict, int all_sections, int everything) {
"io_threaded_writes_processed:%lld\r\n", server.stat_io_writes_processed,
"io_threaded_freed_objects:%lld\r\n", server.stat_io_freed_objects,
"io_threaded_poll_processed:%lld\r\n", server.stat_poll_processed_by_io_threads,
"io_threaded_average_prefetch_batch_size:%lld\r\n", average_prefetch_batch_size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally averages are bad for info metrics because they don't give metrics that are useful "now". If the average batch size grew linearly from 1->10 for example, the current value would be 5, which is not that useful since it's not currently useful. My preference would be to expose both prefetch entries and batches so end users can compute an average on their own time frame OR change this to be a rolling average like the other instantaneous metrics.

Also, what do we see as the utility of this metric. You can't really turn this feature off, so would an end user have any idea what to do with this information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to output both stats, Should we consider adding an option for users to disable this feature?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point that disabling it also makes these more useful, and maybe more actionable as well to tune the prefetch value.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. The dict prefetching code is a bit convoluted, mostly because of the global, so I think we can make that a bit cleaner. The other main point is it would be good to understand how we tuned the one magic number.

src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/dict.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/dict.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
@madolson madolson added needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. release-notes This issue should get a line item in the release notes labels Aug 6, 2024
@madolson madolson requested a review from zuiderkwast August 6, 2024 04:45
src/dict.c Outdated Show resolved Hide resolved
@lipzhu
Copy link
Contributor

lipzhu commented Aug 6, 2024

Profiling the cache hit/miss W/o prefetch, the performance boost is impressive. @uriyage Great work.

BTW, the performance improvement is based on the situation that memory latency have been the bottle-neck of main thread, do you have evaluation if memory latency is not the bottle-neck, will the additional prefetch instruction caused regression?

 # w/ prefetch
 Performance counter stats for process id '1841714':

   128,348,636,929      mem_load_retired.l1_hit                                       (66.64%)
     1,917,290,678      mem_load_retired.l1_miss                                      (66.65%)
       736,226,936      mem_load_retired.l2_hit                                       (66.67%)
     1,180,484,127      mem_load_retired.l2_miss                                      (66.71%)
       419,957,504      mem_load_retired.l3_hit                                       (66.69%)
         8,607,452      mem_load_retired.l3_miss                                      (66.65%)

      10.001272139 seconds time elapsed

 # w/o prefetch
 Performance counter stats for process id '1842658':

   118,738,848,689      mem_load_retired.l1_hit                                       (66.64%)
     1,641,809,909      mem_load_retired.l1_miss                                      (66.64%)
       592,192,759      mem_load_retired.l2_hit                                       (66.67%)
     1,049,283,088      mem_load_retired.l2_miss                                      (66.71%)
       373,710,273      mem_load_retired.l3_hit                                       (66.69%)
        31,032,840      mem_load_retired.l3_miss                                      (66.65%)

      10.001325784 seconds time elapsed

Signed-off-by: Uri Yagelnik <[email protected]>
@uriyage uriyage force-pushed the commands-prefetching branch from 2fbf659 to 9a41120 Compare August 6, 2024 23:00
@uriyage
Copy link
Contributor Author

uriyage commented Aug 7, 2024

BTW, the performance improvement is based on the situation that memory latency have been the bottle-neck of main thread, do you have evaluation if memory latency is not the bottle-neck, will the additional prefetch instruction caused regression?

@lipzhu I tested the same code with one key in the database, meaning we accessed the same key repeatedly. In this case, memory latency is not the bottleneck, and as expected prefetching didn't help. However, no regression was observed.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM outside of some existing comments and the clang format thing. Did we decide to make the batch size configurable and/or self-tuning? Making it configurable would also allow a value of zero to be "no prefetching" which also gives users an operational knob if it's not working for any reason.

Also, please open the doc PR as we discussed, then we can add it to the Valkey 8 board so it doesn't get dropped and this PR is unblocked to get merged.

@lipzhu
Copy link
Contributor

lipzhu commented Aug 14, 2024

Making it configurable would also allow a value of zero to be "no prefetching" which also gives users an operational knob if it's not working for any reason.

+1. To make code structure cleaner maybe we can extract the prefetch related code to a new file and apply the prefetch optimization like a rule.

@uriyage
Copy link
Contributor Author

uriyage commented Aug 14, 2024

LGTM outside of some existing comments and the clang format thing. Did we decide to make the batch size configurable and/or self-tuning? Making it configurable would also allow a value of zero to be "no prefetching" which also gives users an operational knob if it's not working for any reason.

Also, please open the doc PR as we discussed, then we can add it to the Valkey 8 board so it doesn't get dropped and this PR is unblocked to get merged.

@madolson @lipzhu I agree, I will add a configuration for it. However, it will require some changes since the batch is currently static. I also agree that we can refactor the code into a new file.
I believe I'll be able to publish the code by tomorrow. @lipzhu I'm not sure I understand what you mean by 'apply like a rule'.

@lipzhu
Copy link
Contributor

lipzhu commented Aug 15, 2024

@lipzhu I'm not sure I understand what you mean by 'apply like a rule'.

Sorry for the miss understanding, I mean the Rule-Based Optimizer in DBMS, inspired by SparkSQL:spark.sql.adaptive.optimizer.excludedRules. Which give user a option to apply/exclude the prefetch rule based on their real environment.

Ref:
https://spark.apache.org/docs/latest/sql-performance-tuning.html
image

@uriyage uriyage force-pushed the commands-prefetching branch from 24ae4ad to d04a755 Compare August 18, 2024 04:49
@zuiderkwast zuiderkwast removed their request for review August 19, 2024 14:16
src/maa.c Outdated Show resolved Hide resolved
src/maa.c Outdated Show resolved Hide resolved
valkey.conf Outdated Show resolved Hide resolved
valkey.conf Outdated Show resolved Hide resolved
Signed-off-by: Uri Yagelnik <[email protected]>
@madolson madolson removed the needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. label Aug 23, 2024
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a bunch of minor wording things, everything looked good now.

src/config.c Outdated Show resolved Hide resolved
src/config.c Outdated Show resolved Hide resolved
src/memory_prefetch.c Outdated Show resolved Hide resolved
src/memory_prefetch.c Outdated Show resolved Hide resolved
tests/unit/networking.tcl Outdated Show resolved Hide resolved
src/memory_prefetch.c Outdated Show resolved Hide resolved
src/memory_prefetch.c Outdated Show resolved Hide resolved
src/memory_prefetch.c Outdated Show resolved Hide resolved
src/memory_prefetch.c Outdated Show resolved Hide resolved
src/memory_prefetch.c Outdated Show resolved Hide resolved
Signed-off-by: Uri Yagelnik <[email protected]>
@uriyage uriyage force-pushed the commands-prefetching branch from 3dcf709 to c319572 Compare August 25, 2024 07:18
@madolson madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Aug 26, 2024
src/networking.c Outdated Show resolved Hide resolved
Signed-off-by: Madelyn Olson <[email protected]>
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, kicked off the CI again with all tests.

@madolson madolson changed the title Memory Access Amortization Improve multithreaded performance with memory prefetching Aug 27, 2024
@madolson madolson merged commit 04d76d8 into valkey-io:unstable Aug 27, 2024
56 checks passed
madolson pushed a commit that referenced this pull request Sep 2, 2024
This PR utilizes the IO threads to execute commands in batches, allowing
us to prefetch the dictionary data in advance.

After making the IO threads asynchronous and offloading more work to
them in the first 2 PRs, the `lookupKey` function becomes a main
bottle-neck and it takes about 50% of the main-thread time (Tested with
SET command). This is because the Valkey dictionary is a straightforward
but inefficient chained hash implementation. While traversing the hash
linked lists, every access to either a dictEntry structure, pointer to
key, or a value object requires, with high probability, an expensive
external memory access.

### Memory Access Amortization

Memory Access Amortization (MAA) is a technique designed to optimize the
performance of dynamic data structures by reducing the impact of memory
access latency. It is applicable when multiple operations need to be
executed concurrently. The principle behind it is that for certain
dynamic data structures, executing operations in a batch is more
efficient than executing each one separately.

Rather than executing operations sequentially, this approach interleaves
the execution of all operations. This is done in such a way that
whenever a memory access is required during an operation, the program
prefetches the necessary memory and transitions to another operation.
This ensures that when one operation is blocked awaiting memory access,
other memory accesses are executed in parallel, thereby reducing the
average access latency.

We applied this method in the development of `dictPrefetch`, which takes
as parameters a vector of keys and dictionaries. It ensures that all
memory addresses required to execute dictionary operations for these
keys are loaded into the L1-L3 caches when executing commands.
Essentially, `dictPrefetch` is an interleaved execution of dictFind for
all the keys.


**Implementation details**

When the main thread iterates over the `clients-pending-io-read`, for
clients with ready-to-execute commands (i.e., clients for which the IO
thread has parsed the commands), a batch of up to 16 commands is
created. Initially, the command's argv, which were allocated by the IO
thread, is prefetched to the main thread's L1 cache. Subsequently, all
the dict entries and values required for the commands are prefetched
from the dictionary before the command execution. Only then will the
commands be executed.

---------

Signed-off-by: Uri Yagelnik <[email protected]>
madolson pushed a commit that referenced this pull request Sep 3, 2024
This PR utilizes the IO threads to execute commands in batches, allowing
us to prefetch the dictionary data in advance.

After making the IO threads asynchronous and offloading more work to
them in the first 2 PRs, the `lookupKey` function becomes a main
bottle-neck and it takes about 50% of the main-thread time (Tested with
SET command). This is because the Valkey dictionary is a straightforward
but inefficient chained hash implementation. While traversing the hash
linked lists, every access to either a dictEntry structure, pointer to
key, or a value object requires, with high probability, an expensive
external memory access.

### Memory Access Amortization

Memory Access Amortization (MAA) is a technique designed to optimize the
performance of dynamic data structures by reducing the impact of memory
access latency. It is applicable when multiple operations need to be
executed concurrently. The principle behind it is that for certain
dynamic data structures, executing operations in a batch is more
efficient than executing each one separately.

Rather than executing operations sequentially, this approach interleaves
the execution of all operations. This is done in such a way that
whenever a memory access is required during an operation, the program
prefetches the necessary memory and transitions to another operation.
This ensures that when one operation is blocked awaiting memory access,
other memory accesses are executed in parallel, thereby reducing the
average access latency.

We applied this method in the development of `dictPrefetch`, which takes
as parameters a vector of keys and dictionaries. It ensures that all
memory addresses required to execute dictionary operations for these
keys are loaded into the L1-L3 caches when executing commands.
Essentially, `dictPrefetch` is an interleaved execution of dictFind for
all the keys.


**Implementation details**

When the main thread iterates over the `clients-pending-io-read`, for
clients with ready-to-execute commands (i.e., clients for which the IO
thread has parsed the commands), a batch of up to 16 commands is
created. Initially, the command's argv, which were allocated by the IO
thread, is prefetched to the main thread's L1 cache. Subsequently, all
the dict entries and values required for the commands are prefetched
from the dictionary before the command execution. Only then will the
commands be executed.

---------

Signed-off-by: Uri Yagelnik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants